Overview
- Reference genomes and GRC.
- Fasta and FastQ (Unaligned sequences).
- SAM/BAM (Aligned sequences).
- BED (Genomic Intervals).
- GFF/GTF (Gene annotation).
- Wiggle files, BEDgraphs and BigWigs (Genomic scores).
- VCF and MAF (Genomic variations).
Are there we there yet?
- The human genome isnt complete!
- In fact, most model organisms’s reference genomes are being regularly updated.
- Reference genomes consist of mixture of known chromosomes and unplaced contigs called a Genome Reference Assembly.
- Major revisions to assembies result in change of co-ordinates.
- Requires conversion between revisions.
- The latest genome assembly for humans is GRCh38.
- Patches add information to the assembly without disrupting the chromosome coordinates . i.e GRCh38.p3
Why do we need to know about reference genomes
- Allows for genes and genomic features to be evaluated in their linear genomic context.
- Gene A is close to Gene B
- Gene A and Gene B are within feature C.
- Can be used to align shallow targeted high-thoughput sequencing to a pre-built map of an organisms genome.
A reference genome
- A reference genome is a collection of contigs.
- A contig is a stretch of DNA sequence encoded as A,G,C,T,N.
- Typically comes in FASTA format.
- “>” line contains information on contig
- Lines following contain contig sequence
Unaligned Sequences
FastQ - Header
- Header for each read can contain additional information
- HS2000-887_89 - Machine name.
- 5 - Flowcell lane.
- /1 - Read 1 or 2 of pair (here read 1)
Aligned sequences
SAM format
- SAM - Sequence Alignment Map.
- Standard format for sequence data
- Recognised by majority of software and browsers.
Aligned sequences
SAM - Aligned reads
- Contains read and alignment information and location
Aligned sequences
SAM - Aligned reads
- Read name.
- Sequence of read.
Encoded sequence quality.
Aligned sequences
SAM - Aligned reads
- Chromosome to which read aligns.
- Position in chromosome to which 5’ of read aligns.
- Alignment information - “Cigar string”.
- 100M - Continuous match of 100 bases
- 28M1D72M - 28 bases continuously match, 1 deletion from reference, 72 base match
Aligned sequences
SAM - Aligned reads
- Bit flag - TRUE/FALSE for pre-defined read criteria
- Paired read position and insert size
- User defined flags.
class: inverse, center, middle
Genomic Annotation.
Genomic Annotation
- Chromosome
- Start of feature
- End of Feature
- Strand
Genomic Annotation
- Column 9 contains key pairs (ID=exon01), separated by semi-colons “;”
- ID - Feature name.
- PARENT- Meta-feature name.
Genomic Variants
- Variant Call Format (VCF)
- Mutation Annotation Format (MAF)
Genomic Files for computing .